SHOPCORE Portfolio Analysis

Author

Will Sigal

Published

June 5, 2025

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns   
import plotly.express as px
import plotly.graph_objects as go
from scipy.stats import gaussian_kde
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score, adjusted_mutual_info_score
from yellowbrick.cluster import SilhouetteVisualizer
from plotly.offline import init_notebook_mode, iplot


# Initialize Plotly for notebook integration
#init_notebook_mode(connected=True)

# Set Plotly configuration for better embedding
#import plotly.io as pio
#pio.templates.default = "plotly"
#pio.renderers.default = "notebook_connected"

df_merged  = pd.read_csv("/Users/willsigal/Desktop/Work/Shopcore/properties_with_all_metrics.csv")

Analysis

This was a demonstration of the CRE analysis I can create for acquistions. I began by compiling a comprehensive dataset of a CRE firm’s protfolio - in this instance ShopCore, a subsidary of Blackstone. Utilizing web scraping techniques combined with various APIs, I gathered data points for each property. The resulting dataset combines foot-traffic analytics, detailed customer demographic information (see Persona.ai), and census data.
By integrating these sources of data, I developed a classification model that goes beyond traditional industry standards such as the ICSC classification system. The developed model provides insights tailored specifically to ShopCore’s unique portfolio visitation and demographic profiles.

The application of this classification system is twofold:

Portfolio Insights: Rapidly gain a deeper understanding of each property’s unique market position and visitor profile within the existing portfolio.
Acquisition and Strategic Benchmarking: Efficiently evaluate potential acquisitions by comparing their profiles to existing high-performing properties. For example, the model enables rapid identification of new opportunities by comparing them to properties already in the portfolio. For example, if you were looking at a property in Bronzeville, Chicago anchored by Jewel Osco and complemented by tenants like Culvers and Snipes, the model would classify it as a hypothetical “Type 4” property thats 70% similar to one of the properties, like Vista Shopping Center.

About This Report (if viewing this as an HTML file)

Navigate through the document using the table of contents on the left
Most of the plots are interactive, so feel free to explore the data
The analysis is draft and could be updated to be more accurate and informative if I was able to get more data, aside from what I could find on the web

Section 1: Analyzing the Portfolio

the Top 10 Performing Centers in 2024 by Visits per Square Foot

The graph below shows the top 10 performing centers in 2024 by visits per square foot
We can see that Plaza Pacomia, Stony Creek, and Rosewood Shopping Center are the top performers with 29, 25, and 23 visits per square foot respectively. After that, the numbers drop off to around 20 visits per square foot

Code

# Get top 10 performing centers in 2024
top_10_2024 = df_merged.nlargest(10, 'visits_per_sqft_2024')

# Create interactive bar chart
fig = px.bar(
    top_10_2024,
    x='Property Name',
    y='visits_per_sqft_2024',
    title='Top 10 Performing Centers by Visits per Square Foot (2024)',
    labels={
        'Property Name': 'Shopping Center',
        'visits_per_sqft_2024': 'Visits per Square Foot'
    },
    hover_data={
        'Property Name': True,
        'visits_per_sqft_2024': ':.2f',
        'Overall GLA (SF)': True,
        'visits_2024': True
    }
)

# Customize the layout
fig.update_layout(
    xaxis_tickangle=-45,
    showlegend=False,
    height=600,
    margin=dict(t=50, b=100, l=50, r=50)
)

# Add hover template
fig.update_traces(
    hovertemplate="<b>%{x}</b><br>" +
                  "Visits per SF: %{y:.2f}<br>" +
                  "Total SF: %{customdata[0]:,.0f}<br>" +
                  "Total Visits: %{customdata[1]:,.0f}<br>" +
                  "<extra></extra>"
)

# Display the plot
fig.show()

# Create a professional table for top performers
display_cols = ['Property Name', 'visits_per_sqft_2024', 'Overall GLA (SF)', 'visits_2024']
styled_table = top_10_2024[display_cols].copy()

# Rename columns for display
styled_table.columns = ['Property Name', 'Visits per Sq.Ft.', 'GLA (SF)', 'Total Visits (2024)']

# Apply styling for a professional look
styled_table = styled_table.style.format({
    'Visits per Sq.Ft.': '{:.2f}',
    'GLA (SF)': '{:,.0f}',
    'Total Visits (2024)': '{:,.0f}'
}).set_caption('Top 10 Performing Centers by Visits per Square Foot (2024)').set_table_styles([
    {'selector': 'caption', 'props': [('font-weight', 'bold'), ('font-size', '1.2em'), ('text-align', 'center')]},
    {'selector': 'th', 'props': [('background-color', '#0066cc'), ('color', 'white'), ('font-weight', 'bold'), ('text-align', 'center')]},
    {'selector': 'td', 'props': [('text-align', 'center')]},
    {'selector': 'tr:nth-of-type(even)', 'props': [('background-color', '#f2f2f2')]}
]).hide(axis='index')

# Display the styled table
display(styled_table)

Table 1: Top 10 Performing Centers by Visits per Square Foot (2024)

Property Name	Visits per Sq.Ft.	GLA (SF)	Total Visits (2024)
PLAZA PACOIMA	29.37	203,743	5,984,549
STONY CREEK PROMENADE	25.14	157,647	3,963,282
ROSEWOOD SHOPPING CENTER	23.27	36,887	858,232
VISTA SHOPPING CENTER	20.85	88,699	1,849,305
KEARNY SQUARE	20.62	138,860	2,862,892
OAK PARK VILLAGE	20.60	64,659	1,332,072
NORTHEAST TOWER CENTER	19.13	301,209	5,763,103
BARCLAY CROSSING	18.98	54,958	1,043,286
UNIVERSITY PALMS	18.81	99,172	1,865,200
STONEHILL & DEL OBISPO	18.38	52,675	967,915

Looking at the Distribution of Visits per Square Foot in 2024

Instead of looking at the top 10 performing centers, I wanted to examine the distribution of visits per square foot in 2024. This will help us understand the range of performance in the portfolio. Are there outliers? And how centered is the distribution?
The mean visits per square foot is 14.2, and the median is 13.6. This means that the distribution is right-skewed, and there are more properties performing below the mean than above.
We can see that Rosewood Shopping Center and Plaza Pacomia are performing above the mean, while Streets of Woodfield and Neapolitan Way are performing far below the mean.

Code

# Create KDE plot with individual points
fig = go.Figure()

# Add KDE curve
kde_x = np.linspace(df_merged['visits_per_sqft_2024'].min(), 
                    df_merged['visits_per_sqft_2024'].max(), 
                    100)
kde = gaussian_kde(df_merged['visits_per_sqft_2024'].dropna())
kde_y = kde(kde_x)

fig.add_trace(go.Scatter(
    x=kde_x,
    y=kde_y,
    mode='lines',
    name='KDE',
    line=dict(color='blue', width=2)
))

# Add individual points
fig.add_trace(go.Scatter(
    x=df_merged['visits_per_sqft_2024'],
    y=np.zeros_like(df_merged['visits_per_sqft_2024']),  # Place points at y=0
    mode='markers',
    name='Properties',
    marker=dict(
        size=8,
        color='red',
        opacity=0.6
    ),
    hovertext=df_merged['Property Name'],
    hoverinfo='text+x'
))

# Update layout
fig.update_layout(
    title='Distribution of Visits per Square Foot (2024)',
    xaxis_title='Visits per Square Foot',
    yaxis_title='Density',
    showlegend=True,
    height=600,
    hovermode='closest'
)

# Add hover template for points
fig.update_traces(
    hovertemplate="<b>%{hovertext}</b><br>" +
                  "Visits per SF: %{x:.2f}<br>" +
                  "<extra></extra>",
    selector=dict(mode='markers')
)

# Display the plot
fig.show()

# Create a professional table for summary statistics
stats_df = pd.DataFrame(df_merged['visits_per_sqft_2024'].describe()).reset_index()
stats_df.columns = ['Statistic', 'Value']
stats_df['Value'] = stats_df['Value'].round(2)

# Ensure median is included in the statistics
if 'median' not in stats_df['Statistic'].values:
    median_value = df_merged['visits_per_sqft_2024'].median()
    median_row = pd.DataFrame({'Statistic': ['median'], 'Value': [round(median_value, 2)]})
    # Insert median after mean (which is typically at index 1)
    stats_df = pd.concat([stats_df.iloc[:2], median_row, stats_df.iloc[2:]]).reset_index(drop=True)

# Create a styled HTML table
stats_html = """
<div style='max-width: 600px; margin: 20px auto;'>
    <h4 style='text-align: center; margin-bottom: 15px;'>Summary Statistics: Visits per Square Foot (2024)</h4>
    <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'>
        <thead>
            <tr style='background-color: #f2f2f2;'>
                <th style='padding: 10px; border: 1px solid #ddd; text-align: left;'>Statistic</th>
                <th style='padding: 10px; border: 1px solid #ddd; text-align: right;'>Value</th>
            </tr>
        </thead>
        <tbody>
"""

# Add rows to the table
for _, row in stats_df.iterrows():
    stats_html += f"""
            <tr>
                <td style='padding: 8px; border: 1px solid #ddd; text-align: left;'><b>{row['Statistic']}</b></td>
                <td style='padding: 8px; border: 1px solid #ddd; text-align: right;'>{row['Value']:,.2f}</td>
            </tr>
    """

stats_html += """
        </tbody>
    </table>
</div>
"""

# Display the HTML table
from IPython.display import HTML
display(HTML(stats_html))

Summary Statistics: Visits per Square Foot (2024)

Statistic	Value
count	58.00
mean	14.19
median	13.68
std	4.53
min	1.52
25%	11.85
50%	13.68
75%	15.73
max	29.37

Looking at the change in visits between 2022 and 2024.

Moving from looking a t performance in 2024 to looking at the change in performance over time, we now compare visits in 2022 to 2024. By doing this, we can isolate the top and bottom performers in terms of visit growth and decline.
- The change can be due to occupancy changes, market conditions, or changes in the consumer’s behavior.

Code

# Calculate the change in visits between 2022 and 2024
df_merged['visits_change'] = df_merged['visits_2024'] - df_merged['visits_2022']
df_merged['visits_change_pct'] = ((df_merged['visits_2024'] - df_merged['visits_2022']) / df_merged['visits_2022'] * 100).round(2)

# Get top and bottom 5 performers based on absolute change
top_5_growth = df_merged.nlargest(5, 'visits_change')
bottom_5_decline = df_merged.nsmallest(5, 'visits_change')

# Figure 1: Top 5 Properties by Visit Growth
fig_growth = go.Figure()

fig_growth.add_trace(
    go.Bar(
        x=top_5_growth['Property Name'],
        y=top_5_growth['visits_change'],
        name='Growth',
        marker_color='#2ecc71',
        hovertemplate="<b>%{x}</b><br>" +
                     "Change in Visits: %{y:,.0f}<br>" +
                     "2022 Visits: %{customdata[0]:,.0f}<br>" +
                     "2024 Visits: %{customdata[1]:,.0f}<br>" +
                     "Growth: %{customdata[2]}%<br>" +
                     "<extra></extra>",
        customdata=np.column_stack((
            top_5_growth['visits_2022'],
            top_5_growth['visits_2024'],
            top_5_growth['visits_change_pct']
        ))
    )
)

fig_growth.update_layout(
    title=dict(
        text='Top 5 Properties by Visit Growth (2022-2024)',
        x=0.5,
        y=0.95,
        xanchor='center',
        yanchor='top',
        font=dict(size=18)
    ),
    height=500,
    width=700,
    showlegend=False,
    template='plotly_white',
    margin=dict(t=100, b=100, l=80, r=50)
)

fig_growth.update_xaxes(tickangle=-45, title_text="Property Name", title_standoff=15)
fig_growth.update_yaxes(title_text="Change in Number of Visits", title_standoff=15)

# Display the growth plot
fig_growth.show()

# Figure 2: Top 5 Properties by Visit Decline
fig_decline = go.Figure()

fig_decline.add_trace(
    go.Bar(
        x=bottom_5_decline['Property Name'],
        y=bottom_5_decline['visits_change'],
        name='Decline',
        marker_color='#e74c3c',
        hovertemplate="<b>%{x}</b><br>" +
                     "Change in Visits: %{y:,.0f}<br>" +
                     "2022 Visits: %{customdata[0]:,.0f}<br>" +
                     "2024 Visits: %{customdata[1]:,.0f}<br>" +
                     "Decline: %{customdata[2]}%<br>" +
                     "<extra></extra>",
        customdata=np.column_stack((
            bottom_5_decline['visits_2022'],
            bottom_5_decline['visits_2024'],
            bottom_5_decline['visits_change_pct']
        ))
    )
)

fig_decline.update_layout(
    title=dict(
        text='Top 5 Properties by Visit Decline (2022-2024)',
        x=0.5,
        y=0.95,
        xanchor='center',
        yanchor='top',
        font=dict(size=18)
    ),
    height=500,
    width=700,
    showlegend=False,
    template='plotly_white',
    margin=dict(t=100, b=100, l=80, r=50)
)

fig_decline.update_xaxes(tickangle=-45, title_text="Property Name", title_standoff=15)
fig_decline.update_yaxes(title_text="Change in Number of Visits", title_standoff=15)

# Display the decline plot
fig_decline.show()
# Create styled HTML tables for visit change data
from IPython.display import HTML

# Format top growth properties table
top_growth_html = """
<div style='max-width: 800px; margin: 20px auto;'>
    <h3 style='text-align: center; color: #2c3e50; margin-bottom: 15px;'>Top 5 Properties by Visit Growth (2022-2024)</h3>
    <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'>
        <thead>
            <tr style='background-color: #3498db; color: white;'>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Property Name</th>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>2022 Visits</th>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>2024 Visits</th>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Change</th>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>% Change</th>
            </tr>
        </thead>
        <tbody>
"""

for _, row in top_5_growth.iterrows():
    top_growth_html += f"""
        <tr style='background-color: #f9f9f9;'>
            <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['Property Name']}</td>
            <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['visits_2022']:,.0f}</td>
            <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['visits_2024']:,.0f}</td>
            <td style='padding: 8px; text-align: center; border: 1px solid #ddd; color: green;'>{row['visits_change']:+,.0f}</td>
            <td style='padding: 8px; text-align: center; border: 1px solid #ddd; color: green;'>{row['visits_change_pct']:+.1f}%</td>
        </tr>
    """

top_growth_html += """
        </tbody>
    </table>
</div>
"""

# Format bottom decline properties table
bottom_decline_html = """
<div style='max-width: 800px; margin: 20px auto;'>
    <h3 style='text-align: center; color: #2c3e50; margin-bottom: 15px;'>Top 5 Properties by Visit Decline (2022-2024)</h3>
    <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'>
        <thead>
            <tr style='background-color: #e74c3c; color: white;'>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Property Name</th>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>2022 Visits</th>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>2024 Visits</th>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Change</th>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>% Change</th>
            </tr>
        </thead>
        <tbody>
"""

for _, row in bottom_5_decline.iterrows():
    bottom_decline_html += f"""
        <tr style='background-color: #f9f9f9;'>
            <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['Property Name']}</td>
            <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['visits_2022']:,.0f}</td>
            <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['visits_2024']:,.0f}</td>
            <td style='padding: 8px; text-align: center; border: 1px solid #ddd; color: red;'>{row['visits_change']:+,.0f}</td>
            <td style='padding: 8px; text-align: center; border: 1px solid #ddd; color: red;'>{row['visits_change_pct']:+.1f}%</td>
        </tr>
    """

bottom_decline_html += """
        </tbody>
    </table>
</div>
"""

# Format overall statistics table
overall_stats_html = f"""
<div style='max-width: 800px; margin: 20px auto;'>
    <h3 style='text-align: center; color: #2c3e50; margin-bottom: 15px;'>Overall Visit Change Statistics (2022-2024)</h3>
    <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'>
        <tbody>
            <tr style='background-color: #f2f2f2;'>
                <td style='padding: 10px; text-align: center; border: 1px solid #ddd; font-weight: bold;'>Average Change</td>
                <td style='padding: 10px; text-align: center; border: 1px solid #ddd;'>{df_merged['visits_change'].mean():,.0f} visits</td>
            </tr>
            <tr>
                <td style='padding: 10px; text-align: center; border: 1px solid #ddd; font-weight: bold;'>Median Change</td>
                <td style='padding: 10px; text-align: center; border: 1px solid #ddd;'>{df_merged['visits_change'].median():,.0f} visits</td>
            </tr>
            <tr style='background-color: #f2f2f2;'>
                <td style='padding: 10px; text-align: center; border: 1px solid #ddd; font-weight: bold;'>Properties with Growth</td>
                <td style='padding: 10px; text-align: center; border: 1px solid #ddd;'>{(df_merged['visits_change'] > 0).sum():,}</td>
            </tr>
            <tr>
                <td style='padding: 10px; text-align: center; border: 1px solid #ddd; font-weight: bold;'>Properties with Decline</td>
                <td style='padding: 10px; text-align: center; border: 1px solid #ddd;'>{(df_merged['visits_change'] < 0).sum():,}</td>
            </tr>
        </tbody>
    </table>
</div>
"""

# Display all tables
display(HTML(top_growth_html + bottom_decline_html + overall_stats_html))

Top 5 Properties by Visit Growth (2022-2024)

Property Name	2022 Visits	2024 Visits	Change	% Change
WESTWOOD VILLAGE	3,637,616	4,612,160	+974,544	+26.8%
THE SHOPS AT SKYVIEW	4,864,060	5,473,364	+609,304	+12.5%
STREETS OF WOODFIELD	3,852,468	4,415,083	+562,615	+14.6%
BAKERS CENTRE	3,122,542	3,504,967	+382,425	+12.2%
CANARSIE PLAZA	3,343,802	3,711,930	+368,128	+11.0%

Top 5 Properties by Visit Decline (2022-2024)

Property Name	2022 Visits	2024 Visits	Change	% Change
TOWNE CENTER EAST	2,500,961	2,059,712	-441,249	-17.6%
PARKWAY CENTRE	1,012,845	593,583	-419,262	-41.4%
PLAZA PACOIMA	6,340,670	5,984,549	-356,121	-5.6%
MIDWAY MARKET	1,196,359	1,043,799	-152,560	-12.8%
NEAPOLITAN WAY	1,113,515	1,002,344	-111,171	-10.0%

Overall Visit Change Statistics (2022-2024)

Average Change	75,213 visits
Median Change	45,690 visits
Properties with Growth	39
Properties with Decline	19

Analyzing the Trade Area Size

Let’s examine the distribution of our trade areas. This metric helps us understand the size of the area that our shopping center serves.
We can see that the trade area size is right-skewed, with a strong density of properties having trade areas around 25 sq. miles. However, we have a few large outliers with incredibly large trade areas (Streets of Woodfield and One Colorado with +100 sq. mile trade areas).

Code

# Create KDE plot 
fig = go.Figure()

# Add KDE curve
kde_x = np.linspace(df_merged['trade_area_sq_mi'].min(), 
                    df_merged['trade_area_sq_mi'].max(), 
                    100)
kde = gaussian_kde(df_merged['trade_area_sq_mi'].dropna())
kde_y = kde(kde_x)

fig.add_trace(go.Scatter(
    x=kde_x,
    y=kde_y,
    mode='lines',
    name='KDE',
    line=dict(color='blue', width=2)
))

# Add individual points
fig.add_trace(go.Scatter(
    x=df_merged['trade_area_sq_mi'],
    y=np.zeros_like(df_merged['trade_area_sq_mi']),  # Place points at y=0
    mode='markers',
    name='Properties',
    marker=dict(
        size=8,
        color='red',
        opacity=0.6
    ),
    hovertext=df_merged['Property Name'],
    hoverinfo='text+x'
))

# Update layout
fig.update_layout(
    title='Distribution of Trade Area Sizes',
    xaxis_title='Trade Area Size (Square Miles)',
    yaxis_title='Density',
    showlegend=True,
    height=600,
    hovermode='closest'
)

# Add hover template for points
fig.update_traces(
    hovertemplate="<b>%{hovertext}</b><br>" +
                  "Trade Area: %{x:.2f} sq mi<br>" +
                  "<extra></extra>",
    selector=dict(mode='markers')
)

# Display the plot
fig.show()

# Create a professional table for summary statistics
stats_df = pd.DataFrame(df_merged['trade_area_sq_mi'].describe()).reset_index()
stats_df.columns = ['Statistic', 'Value']
stats_df['Value'] = stats_df['Value'].round(2)

# Ensure median is included in the statistics
if 'median' not in stats_df['Statistic'].values:
    median_value = df_merged['trade_area_sq_mi'].median()
    median_row = pd.DataFrame({'Statistic': ['median'], 'Value': [round(median_value, 2)]})
    # Insert median after mean (which is typically at index 1)
    stats_df = pd.concat([stats_df.iloc[:2], median_row, stats_df.iloc[2:]]).reset_index(drop=True)

# Create a styled HTML table
stats_html = """
<div style='max-width: 600px; margin: 20px auto;'>
    <h4 style='text-align: center; margin-bottom: 15px;'>Summary Statistics: Trade Area Size (Square Miles)</h4>
    <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'>
        <thead>
            <tr style='background-color: #f2f2f2;'>
                <th style='padding: 10px; border: 1px solid #ddd; text-align: left;'>Statistic</th>
                <th style='padding: 10px; border: 1px solid #ddd; text-align: right;'>Value</th>
            </tr>
        </thead>
        <tbody>
"""

# Add rows to the table
for _, row in stats_df.iterrows():
    stats_html += f"""
            <tr>
                <td style='padding: 8px; border: 1px solid #ddd; text-align: left;'><b>{row['Statistic']}</b></td>
                <td style='padding: 8px; border: 1px solid #ddd; text-align: right;'>{row['Value']:,.2f}</td>
            </tr>
    """

stats_html += """
        </tbody>
    </table>
</div>
"""

# Display the HTML table
from IPython.display import HTML
display(HTML(stats_html))

Summary Statistics: Trade Area Size (Square Miles)

Statistic	Value
count	60.00
mean	33.84
median	27.05
std	27.60
min	3.93
25%	16.71
50%	27.05
75%	40.45
max	166.41

Section 2: Clustering Analysis to Understand the Segmentation of the Portfolio

Here we will use more advanced statistical methods to break down the portfolio into different segments based on the demographic and economic characteristics of the trade area.
First, we will reduce our data to the most important components.
Then, we will identify common characteristics unique to certain groups in our portfolio and represent them as clusters.

Part 1: Clustering Using Only Census Data

The following graphs first show a scatter plot of our properties in a reduced dimensional space, with colors indicating the cluster they were assigned to. Then we show a heatmap of the average values of the demographic and economic characteristics of the properties in each cluster - this is measured by z-scores (standard deviations from the mean). Finally, we show a summary table of the characteristics of each cluster.
Analysis: We can see that with the census data, our properties were only split into 2 groups. While we can see the cluster summary shows that there is strong variation between the groups (with one cluster being the lower density, higher income, and higher education group), we might want to include more information to help break down our portfolio into more groups.

Code

# Select demographic columns
demographic_cols = ['population', 'median_income', 'median_age', 'pct_white', 
                   'pct_black', 'pct_asian', 'pct_hispanic', 'pct_college_plus', 
                   'pct_homeowners', 'median_house_value', 'trade_area_population_density', 'trade_area_sq_mi']

# Prepare data for clustering
X = df_merged[demographic_cols].copy()
X = X.fillna(X.mean())  # Fill missing values with mean
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Determine optimal number of PCA components
pca_full = PCA()
pca_full.fit(X_scaled)
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

# Find optimal number of components using 80% culumative var
n_components_80 = np.argmax(cumulative_variance >= 0.8) + 1

# Perform PCA with optimal components
pca = PCA(n_components=n_components_80)
X_pca = pca.fit_transform(X_scaled)

# Determine optimal number of clusters using multiple metrics
K = range(2, 11)
metrics = {
    'Silhouette': [],
    'Calinski-Harabasz': [],
    'Davies-Bouldin': []
}

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    
    # Calculate metrics
    metrics['Silhouette'].append(silhouette_score(X_scaled, labels))
    metrics['Calinski-Harabasz'].append(calinski_harabasz_score(X_scaled, labels))
    metrics['Davies-Bouldin'].append(davies_bouldin_score(X_scaled, labels))

# Find optimal nmber of clusters using combined metrics
silhouette_opt = K[np.argmax(metrics['Silhouette'])]
calinski_opt = K[np.argmax(metrics['Calinski-Harabasz'])]
davies_opt = K[np.argmin(metrics['Davies-Bouldin'])]

# Use the most common optimalk
optimal_k = max(set([silhouette_opt, calinski_opt, davies_opt]), 
                key=[silhouette_opt, calinski_opt, davies_opt].count)

# Perform K-means clustering with optimal number of clusters
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
df_merged['Cluster'] = kmeans.fit_predict(X_scaled)

# Create visualization of clusters in PCA space
fig = px.scatter(
    x=X_pca[:, 0],
    y=X_pca[:, 1],
    color=df_merged['Cluster'].astype(str),
    title='Property Clusters in PCA Space',
    labels={'x': f'First Principal Component ({pca.explained_variance_ratio_[0]:.1%} variance)',
            'y': f'Second Principal Component ({pca.explained_variance_ratio_[1]:.1%} variance)',
            'color': 'Cluster'},
    hover_data={'Property': df_merged['Property Name']}
)

fig.update_layout(
    template='plotly_white',
    height=600,
    width=700,
    showlegend=True,
    legend_title_text='Cluster',
    font=dict(family="Arial, sans-serif", size=12),
    margin=dict(l=50, r=50, t=80, b=50)
)

fig.show()

# Create cluster profile visualization
cluster_profiles = df_merged.groupby('Cluster')[demographic_cols].mean()

# Calculate z-scores for each demographic variable across clusters
cluster_profiles_z = cluster_profiles.apply(lambda x: (x - x.mean()) / x.std(), axis=0)

fig = go.Figure(data=go.Heatmap(
    z=cluster_profiles_z.values,
    x=[col.replace('_', ' ').title() for col in demographic_cols],
    y=[f'Cluster {i}' for i in range(optimal_k)],
    colorscale='RdBu_r',
    zmid=0,
    text=np.round(cluster_profiles_z.values, 2),
    texttemplate='%{text}',
    textfont={"size": 10},
    hoverongaps=False,
    hovertemplate="<b>%{y}</b><br>" +
                 "<b>%{x}</b>: %{z:.2f} σ<br>" +
                 "<extra></extra>"
))

fig.update_layout(
    title='Cluster Demographic Profiles (Z-Scores)',
    xaxis_title='Demographic Variables',
    yaxis_title='Cluster',
    height=600,
    width=700,
    template='plotly_white',
    font=dict(family="Arial, sans-serif", size=12),
    margin=dict(l=50, r=50, t=80, b=50)
)

fig.show()

# Create a summary table of cluster characteristics
cluster_summary = pd.DataFrame()
for i in range(optimal_k):
    cluster_properties = df_merged[df_merged['Cluster'] == i]
    cluster_summary = pd.concat([cluster_summary, pd.DataFrame({
        'Cluster': [f'Cluster {i}'],
        'Size': [len(cluster_properties)],
        'Avg Visits 2024': [cluster_properties['visits_2024'].mean()],
        'Avg Visits/SF': [cluster_properties['visits_per_sqft_2024'].mean()],
        'Median Income': [cluster_properties['median_income'].mean()],
        'Population': [cluster_properties['population'].mean()],
        'College Education %': [cluster_properties['pct_college_plus'].mean()],
        'Median House Value': [cluster_properties['median_house_value'].mean()]
    })])

# Display formatted cluster summary
cluster_summary_html = """
<div style='max-width: 1000px; margin: 20px auto;'>
    <h3 style='text-align: center; color: #2c3e50; margin-bottom: 15px;'>Cluster Summary</h3>
    <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'>
        <thead>
            <tr style='background-color: #f2f2f2;'>
                <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>Cluster</th>
                <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>Size</th>
                <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>Avg Visits 2024</th>
                <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>Avg Visits/SF</th>
                <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>Median Income</th>
                <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>Population</th>
                <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>College Education %</th>
                <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>Median House Value</th>
            </tr>
        </thead>
        <tbody>
"""

for _, row in cluster_summary.iterrows():
    cluster_summary_html += f"""
        <tr>
            <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'><b>{row['Cluster']}</b></td>
            <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'>{row['Size']:,.0f}</td>
            <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'>{row['Avg Visits 2024']:,.0f}</td>
            <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'>{row['Avg Visits/SF']:.2f}</td>
            <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'>${row['Median Income']:,.0f}</td>
            <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'>{row['Population']:,.0f}</td>
            <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'>{row['College Education %']:.1f}%</td>
            <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'>${row['Median House Value']:,.0f}</td>
        </tr>
    """

cluster_summary_html += """
        </tbody>
    </table>
</div>
"""

display(HTML(cluster_summary_html))

Cluster Summary

Cluster	Size	Avg Visits 2024	Avg Visits/SF	Median Income	Population	College Education %	Median House Value
Cluster 0	17	2,950,211	15.18	$69,357	432,738	56.9%	$522,980
Cluster 1	43	1,578,605	13.78	$99,516	103,865	72.9%	$485,289

Code

# Select segmentation columns
seg_cols = [col for col in df_merged.columns if col.startswith('seg_')]

# Prepare data for clustering
X_seg = df_merged[seg_cols].copy()
X_seg = X_seg.fillna(X_seg.mean())  # Fill missing values with mean

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_seg)

# Import required libraries
import umap
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score, adjusted_mutual_info_score
import warnings

# Suppress UMAP warnings
warnings.filterwarnings("ignore", category=UserWarning, module="umap")

# Function to evaluate clustering
def evaluate_clustering(X, n_components_range, n_neighbors_range):
    best_silhouette = -1
    best_config = None
    best_labels = None
    best_umap = None
    
    for n_neighbors in n_neighbors_range:
        # Apply UMAP with explicit n_jobs parameter
        reducer = umap.UMAP(n_neighbors=n_neighbors, random_state=42, n_jobs=1)
        X_umap = reducer.fit_transform(X)
        
        for n_components in n_components_range:
            # Fit GMM
            gmm = GaussianMixture(n_components=n_components, random_state=42)
            labels = gmm.fit_predict(X_umap)
            
            # Calculate metrics
            silhouette = silhouette_score(X_umap, labels)
            
            if silhouette > best_silhouette:
                best_silhouette = silhouette
                best_config = (n_neighbors, n_components)
                best_labels = labels
                best_umap = X_umap
    
    return best_silhouette, best_config, best_labels, best_umap

# Define parameter ranges
n_components_range = range(2, 11)
n_neighbors_range = [5, 10, 15, 20, 30]

# Evaluate clustering
best_silhouette, best_config, best_labels, best_umap = evaluate_clustering(
    X_scaled, n_components_range, n_neighbors_range
)

# print(f"\nBest configuration:")
# print(f"UMAP n_neighbors: {best_config[0]}")
# print(f"Number of GMM components: {best_config[1]}")
# print(f"Silhouette score: {best_silhouette:.3f}")

# Store best clustering results
df_merged['Seg_Cluster'] = best_labels

# Create UMAP visualization
fig = px.scatter(
    x=best_umap[:, 0],
    y=best_umap[:, 1],
    color=df_merged['Seg_Cluster'],
    title='Property Clusters in UMAP Space (Segmentation Variables)',
    labels={'x': 'UMAP Dimension 1',
            'y': 'UMAP Dimension 2',
            'color': 'Cluster'},
    hover_data={'Property Name': df_merged['Property Name']}
)

fig.update_layout(
    template='plotly_white',
    height=800,
    width=1200,
    showlegend=True,
    legend_title_text='Cluster'
)

# fig.show()

# Create heatmap of cluster characteristics with z-scores
cluster_profiles_seg = df_merged.groupby('Seg_Cluster')[seg_cols].mean()
cluster_profiles_seg_z = cluster_profiles_seg.apply(lambda x: (x - x.mean()) / x.std(), axis=0)

fig = go.Figure(data=go.Heatmap(
    z=cluster_profiles_seg_z.values,
    x=seg_cols,
    y=[f'Cluster {i}' for i in range(best_config[1])],
    colorscale='RdBu_r',
    zmid=0,
    text=np.round(cluster_profiles_seg_z.values, 2),
    texttemplate='%{text}',
    textfont={"size": 10},
    hoverongaps=False,
    hovertemplate="<b>%{y}</b><br>" +
                 "<b>%{x}</b>: %{z:.2f} σ<br>" +
                 "<extra></extra>"
))

fig.update_layout(
    title='Cluster Segmentation Profiles (Z-Scores)',
    xaxis_title='Segmentation Variables',
    yaxis_title='Cluster',
    height=600,
    width=700,
    template='plotly_white'
)

# fig.show()

# Print cluster statistics
# print("\nCluster Size Distribution:")
# print(df_merged['Seg_Cluster'].value_counts().sort_index())

# print("\nCluster Characteristics (Z-Scores):")
cluster_stats_seg_z = pd.DataFrame({
    col: (df_merged[col] - df_merged[col].mean()) / df_merged[col].std()
    for col in seg_cols
})
cluster_stats_seg_z['Seg_Cluster'] = df_merged['Seg_Cluster']
cluster_stats_seg_z = cluster_stats_seg_z.groupby('Seg_Cluster').agg(['mean', 'std']).round(2)
# print(cluster_stats_seg_z)

# Compare segmentation clusters with demographic clusters
# print("\nCross-tabulation of Demographic and Segmentation Clusters:")
cross_tab = pd.crosstab(df_merged['Cluster'], df_merged['Seg_Cluster'])
# print(cross_tab)

# Calculate adjusted mutual information between the two clusterings
ami_score = adjusted_mutual_info_score(df_merged['Cluster'], df_merged['Seg_Cluster'])
# print(f"\nAdjusted Mutual Information between clusterings: {ami_score:.3f}")

# Create violin plots for key segmentation variables
#fig = make_subplots(
#    rows=2, cols=2,
#    subplot_titles=('Top 4 Most Variable Segments by Cluster',
#                   'Next 4 Most Variable Segments by Cluster',
                   #'Next 4 Most Variable Segments by Cluster',
                   #'Remaining Segments by Cluster'),
  #  specs=[[{"type": "violin"}, {"type": "violin"}],
          # [{"type": "violin"}, {"type": "violin"}]])

# Get most variable segments
variances = cluster_profiles_seg_z.var().sort_values(ascending=False)
top_segments = variances.index.tolist()

# Plot segments in groups of 4
#for i in range(4):
    #row = i // 2 + 1
    #col = i % 2 + 1
    #start_idx = i * 4
    #end_idx = min(start_idx + 4, len(top_segments))
    
    #for segment in top_segments[start_idx:end_idx]:
        #fig.add_trace(
        #    go.Violin(
        #        y=df_merged[segment],
        #        x=df_merged['Seg_Cluster'],
        #        name=segment,
        #        box_visible=True,
             #   meanline_visible=True,
              #  points="outliers"
            #),
        #    row=row, col=col
        #)

#fig.update_layout(
#    title='Distribution of Key Segmentation Variables by Cluster',
#    height=1000,
#    width=1200,
#    showlegend=True,
#    template='plotly_white'
#)

#fig.show()

Part 2: Clustering using all metrics.

Here we will cluster our properties using all of the metrics we have available to us.
We will use the same clustering method as before, but now we will use all of the metrics.
The figures will be a mapping of our properties in a 3D space, a heatmap of the average values of the metrics in each cluster, and a summary table of the characteristics of each cluster along with example properties for each cluster.
Analysis: We can see that we have 4 distinct clusters. Cluster 0 has high population, high percentage of melting pot families and a relatively large trade area. Cluster 1 also has a high tradearea but a lower population and higher porportion of blue collar and rural visitors. Cluster 2 has a moderate population, moderate trade area, and an older demographic. Cluster 3 is highly educated and wealthy but has a relatively small trade area.

Code

# Select all relevant columns for clustering
all_metrics_cols = [
    # Demographic columns
    'population', 'median_income', 'median_age', 'pct_white', 
    'pct_black', 'pct_asian', 'pct_hispanic', 'pct_college_plus', 
    'pct_homeowners', 'median_house_value', 'trade_area_population_density', 'trade_area_sq_mi',
    # Visit metrics
    'visits_per_sqft_2022', 'visits_per_sqft_2023', 'visits_per_sqft_2024',
    # Segmentation columns
    *[col for col in df_merged.columns if col.startswith('seg_')]
]

# Prepare data for clustering
X_all = df_merged[all_metrics_cols].copy()
X_all = X_all.fillna(X_all.mean())  # Fill missing values with mean

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_all)

# Function to evaluate clustering
def evaluate_clustering(X, n_components_range, n_neighbors_range):
    best_silhouette = -1
    best_config = None
    best_labels = None
    best_metrics = None
    
    for n_neighbors in n_neighbors_range:
        # Apply UMAP
        reducer = umap.UMAP(n_neighbors=n_neighbors, random_state=42)
        X_umap = reducer.fit_transform(X)
        
        for n_components in n_components_range:
            # Fit GMM
            gmm = GaussianMixture(n_components=n_components, random_state=42)
            labels = gmm.fit_predict(X_umap)
            
            # Calculate metrics
            silhouette = silhouette_score(X_umap, labels)
            calinski = calinski_harabasz_score(X_umap, labels)
            davies = davies_bouldin_score(X_umap, labels)
            
            if silhouette > best_silhouette:
                best_silhouette = silhouette
                best_config = (n_neighbors, n_components)
                best_labels = labels
                best_umap = X_umap
                best_metrics = {'silhouette': silhouette, 'calinski': calinski, 'davies': davies}
    
    return best_silhouette, best_config, best_labels, best_umap, best_metrics

# Define parameter ranges
n_components_range = range(3, 11)
n_neighbors_range = [5, 10, 15, 20, 30]

# Evaluate clustering
best_silhouette, best_config, best_labels, best_umap, best_metrics = evaluate_clustering(
    X_scaled, n_components_range, n_neighbors_range
)

# Store best clustering results
df_merged['All_Metrics_Cluster'] = best_labels

# Create 3D UMAP visualization
reducer_3d = umap.UMAP(n_neighbors=best_config[0], n_components=3, random_state=42)
umap_3d = reducer_3d.fit_transform(X_scaled)

fig = px.scatter_3d(
    x=umap_3d[:, 0],
    y=umap_3d[:, 1],
    z=umap_3d[:, 2],
    color=df_merged['All_Metrics_Cluster'],
    title='Property Clusters in 3D UMAP Space (All Metrics)',
    labels={'x': 'UMAP Dimension 1',
            'y': 'UMAP Dimension 2',
            'z': 'UMAP Dimension 3',
            'color': 'Cluster'},
    hover_data={'Property Name': df_merged['Property Name']}
)

fig.update_layout(
    template='plotly_white',
    height=600,
    width=700,
    showlegend=True,
    legend_title_text='Cluster',
    scene=dict(
        xaxis_title='UMAP Dimension 1',
        yaxis_title='UMAP Dimension 2',
        zaxis_title='UMAP Dimension 3'
    )
)
fig.write_html("umap_3d.html")

fig.show()

# Create heatmap of cluster characteristics with z-scores
cluster_profiles = df_merged.groupby('All_Metrics_Cluster')[all_metrics_cols].mean()
cluster_profiles_z = cluster_profiles.apply(lambda x: (x - x.mean()) / x.std(), axis=0)

# Create a more focused heatmap with only the most important variables
# Select top variables by variance across clusters
top_vars = cluster_profiles_z.var(axis=0).nlargest(15).index.tolist()
focused_profiles_z = cluster_profiles_z[top_vars]

fig = go.Figure(data=go.Heatmap(
    z=focused_profiles_z.values,
    x=top_vars,
    y=[f'Cluster {i}' for i in range(best_config[1])],
    colorscale='RdBu_r',
    zmid=0,
    text=np.round(focused_profiles_z.values, 2),
    texttemplate='%{text}',
    textfont={"size": 10},
    hoverongaps=False,
    hovertemplate="<b>%{y}</b><br>" +
                 "<b>%{x}</b>: %{z:.2f} σ<br>" +
                 "<extra></extra>"
))

fig.update_layout(
    title='Cluster Profiles - Key Differentiating Factors (Z-Scores)',
    xaxis_title='Metrics',
    yaxis_title='Cluster',
    height=600,
    width=700,
    template='plotly_white'
)

fig.show()

# Create a summary table of cluster characteristics
cluster_summary_html = """
<div style='max-width: 1000px; margin: 20px auto;'>
    <h3 style='text-align: center; color: #2c3e50; margin-bottom: 15px;'>Cluster Summary</h3>
    <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'>
        <thead>
            <tr style='background-color: #3498db; color: white;'>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Cluster</th>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Size</th>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Key Characteristics</th>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Example Properties</th>
            </tr>
        </thead>
        <tbody>
"""

for cluster_id in range(best_config[1]):
    # Get properties in this cluster
    cluster_properties = df_merged[df_merged['All_Metrics_Cluster'] == cluster_id]
    
    # Get top characteristics (highest z-scores)
    top_chars = cluster_profiles_z.loc[cluster_id].nlargest(5)
    top_chars_formatted = "<br>".join([f"<b>{col}</b>: {val:.2f}σ" for col, val in top_chars.items()])
    
    # Get example properties (random sample of 3)
    example_props = cluster_properties.sample(min(3, len(cluster_properties)))['Property Name'].tolist()
    example_props_formatted = "<br>".join(example_props)
    
    # Add row to table
    cluster_summary_html += f"""
        <tr style='background-color: {"#f9f9f9" if cluster_id % 2 == 0 else "white"};'>
            <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>Cluster {cluster_id}</td>
            <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{len(cluster_properties)}</td>
            <td style='padding: 8px; text-align: left; border: 1px solid #ddd;'>{top_chars_formatted}</td>
            <td style='padding: 8px; text-align: left; border: 1px solid #ddd;'>{example_props_formatted}</td>
        </tr>
    """

cluster_summary_html += """
        </tbody>
    </table>
</div>
"""

# Display the HTML table
from IPython.display import HTML
display(HTML(cluster_summary_html))

Cluster Summary

Cluster	Size	Key Characteristics	Example Properties
Cluster 0	21	population: 1.49σ seg_p_melting_pot_families: 1.49σ seg_i_young_urban_singles: 1.48σ seg_q_city_hopefuls: 1.47σ trade_area_population_density: 1.47σ	ROSEWOOD SHOPPING CENTER VISTA SHOPPING CENTER UNIVERSITY PALMS
Cluster 1	22	trade_area_sq_mi: 1.35σ seg_c_upper_suburban_diverse_families: 1.18σ seg_f_blue_collar_suburbs: 1.04σ seg_l_rural_high_income: 0.92σ pct_black: 0.78σ	PARKWAY CENTRE THOMPSON BRIDGE COMMONS FAIRFIELD SHOPPING CENTER
Cluster 2	7	seg_k_budget_boomers: 1.50σ seg_j_sunset_boomers: 1.40σ seg_o_small_town: 1.40σ median_age: 1.33σ seg_n_rural_resilience: 1.11σ	BERRY TOWN CENTER CROSSROADS MARKET SHOPPES OF PARKLAND
Cluster 3	10	median_house_value: 1.46σ seg_a_ultra_wealthy_families: 1.37σ pct_asian: 1.36σ median_income: 1.31σ pct_college_plus: 1.22σ	HERITAGE HILL FOUR POINTS SHOPPING CENTRE STONEHILL & DEL OBISPO

Part 4: Similarity Scores

Here we will calculate the similarity scores between all properties in our dataset. Here I added a 1/3 weight to each category of metric I obtained (demographic, visit, and segmentation) and then calculated the similarity score for each property pair. However, it would be incredibly easy to change the weights to place more emphasis on certain metrics.
I also provide a list of the top 10 most similar properties to each property in the dataset.
The advantage of using similarity scores is that they enable quick identification of comparable properties. This is particularly useful for benchmarking new acquisitions or evaluating performance against other properties in our portfolio.

Code

# Define weights for different metric categories
WEIGHTS = {
    'demographic': 0.3333,
    'visit': 0.3333,
    'segmentation': 0.3333
}

# Define sub-weights for demographic metrics
DEMOGRAPHIC_WEIGHTS = {
    'population': 0.15,
    'trade_area_sq_mi': 0.15,
    'trade_area_population_density': 0.1,
    'median_income': 0.1,
    'pct_homeowners': 0.1,
    'pct_white': 0.1,
    'pct_black': 0.1,
    'pct_asian': 0.1,
    'pct_hispanic': 0.1
}

# Define sub-weights for visit metrics
VISIT_WEIGHTS = {
    'visits_per_sqft_2024': 1
}

def calculate_metric_similarity(val1, val2, metric_type):
    """Calculate similarity between two values for a given metric type"""
    if pd.isna(val1) or pd.isna(val2):
        return 0
    
    # For percentage metrics, use absolute difference
    if metric_type in ['pct_white', 'pct_black', 'pct_asian', 'pct_hispanic', 'pct_college_plus']:
        return 100 - abs(val1 - val2)
    
    # For monetary and population metrics, use relative difference
    elif metric_type in ['population', 'median_income']:
        max_val = max(abs(val1), abs(val2))
        if max_val == 0:
            return 100
        return 100 * (1 - abs(val1 - val2) / max_val)
    
    # For visit metrics, use relative difference
    elif 'visits_per_sqft' in metric_type:
        max_val = max(abs(val1), abs(val2))
        if max_val == 0:
            return 100
        return 100 * (1 - abs(val1 - val2) / max_val)
    
    # For other metrics, use absolute difference
    else:
        max_val = max(abs(val1), abs(val2))
        if max_val == 0:
            return 100
        return 100 * (1 - abs(val1 - val2) / max_val)

def calculate_property_similarity(prop1, prop2):
    """Calculate overall similarity between two properties"""
    # Demographic similarity
    demographic_similarity = sum(
        DEMOGRAPHIC_WEIGHTS[metric] * calculate_metric_similarity(
            prop1[metric], prop2[metric], metric
        )
        for metric in DEMOGRAPHIC_WEIGHTS.keys()
    )
    
    # Visit metrics similarity
    visit_similarity = sum(
        VISIT_WEIGHTS[metric] * calculate_metric_similarity(
            prop1[metric], prop2[metric], metric
        )
        for metric in VISIT_WEIGHTS.keys()
    )
    
    # Segmentation metrics similarity
    seg_metrics = [col for col in df_merged.columns if col.startswith('seg_')]
    seg_similarity = sum(
        calculate_metric_similarity(prop1[metric], prop2[metric], metric)
        for metric in seg_metrics
    ) / len(seg_metrics) if seg_metrics else 0
    
    # Calculate weighted average
    total_similarity = (
        WEIGHTS['demographic'] * demographic_similarity +
        WEIGHTS['visit'] * visit_similarity +
        WEIGHTS['segmentation'] * seg_similarity
    )
    
    return round(total_similarity, 2)

# Create a matrix of similarity scores
property_names = df_merged['Property Name'].tolist()
n_properties = len(property_names)
similarity_matrix = np.zeros((n_properties, n_properties))

# Calculate similarity scores for all pairs
for i in range(n_properties):
    for j in range(i + 1, n_properties):
        similarity = calculate_property_similarity(
            df_merged.iloc[i],
            df_merged.iloc[j]
        )
        similarity_matrix[i, j] = similarity
        similarity_matrix[j, i] = similarity  # Matrix is symmetric

# Convert to DataFrame for easier manipulation
similarity_df = pd.DataFrame(
    similarity_matrix,
    index=property_names,
    columns=property_names
)

# Create a list of all property pairs with their similarity scores
property_pairs = []
for i in range(n_properties):
    for j in range(i + 1, n_properties):
        property_pairs.append({
            'Property 1': property_names[i],
            'Property 2': property_names[j],
            'Similarity Score': similarity_matrix[i, j]
        })

# Convert to DataFrame and sort by similarity score
property_pairs_df = pd.DataFrame(property_pairs)
property_pairs_df = property_pairs_df.sort_values('Similarity Score', ascending=False)

# Create a styled HTML table for top property pairs
from IPython.display import HTML

# Format the top 10 most similar property pairs
similarity_html = """
<div style='max-width: 800px; margin: 20px auto;'>
    <h3 style='text-align: center; color: #2c3e50; margin-bottom: 15px;'>Most Similar Property Pairs</h3>
    <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'>
        <thead>
            <tr style='background-color: #3498db; color: white;'>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Property 1</th>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Property 2</th>
                <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Similarity Score</th>
            </tr>
        </thead>
        <tbody>
"""

# Add top 10 rows to the table
for i, row in property_pairs_df.head(10).iterrows():
    similarity_html += f"""
        <tr style='background-color: {"#f9f9f9" if i % 2 == 0 else "white"};'>
            <td style='padding: 8px; text-align: left; border: 1px solid #ddd;'>{row['Property 1']}</td>
            <td style='padding: 8px; text-align: left; border: 1px solid #ddd;'>{row['Property 2']}</td>
            <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['Similarity Score']:.2f}</td>
        </tr>
    """

similarity_html += """
        </tbody>
    </table>
</div>
"""

# Display the HTML table
display(HTML(similarity_html))

Most Similar Property Pairs

Property 1	Property 2	Similarity Score
SANDY PLAINS EXCHANGE	WADE GREEN VILLAGE	76.41
SOUTHGATE VILLAGE	THE MARKET AT VICTORY VILLAGE	75.61
FAIRFIELD SHOPPING CENTER	SOUTHGATE VILLAGE	75.28
BRAWLEY COMMONS	ROSEDALE SHOPPING CENTER	75.21
SPRING HILL PLAZA	WEST TOWN MARKET	74.23
PARKWAY CENTRE	SOUTHGATE VILLAGE	73.99
ROSEDALE SHOPPING CENTER	THE MARKET AT VICTORY VILLAGE	73.91
PUERTA REAL PLAZA	SPRINGBROOK PRAIRIE PAVILION	73.85
SANDY PLAINS EXCHANGE	WEST TOWN MARKET	73.53
ROSEDALE SHOPPING CENTER	SOUTHGATE VILLAGE	73.08

Thank You for Your time Reviewing this!

Email: wsigal@uchicago.edu
Phone: 818-268-3572